Model Overview

The model implements a free-operant reinforcement learning process over fine-grained time bins of width \(\Delta t\) (typically 50 ms). Within each bin the agent may emit at most one response. Actions are denoted \(a \in \{x, y\}\), and outcomes (rewards) can occur in the same or later bins after a stochastic delay.

Time and State

Eligibility traces use a Dutch formulation with per-bin decay \[\lambda_{\text{bin}} = \exp\left(-\frac{\Delta t}{\tau_e}\right).\] State variables per bin \(b\) include the action values \(Q_b(a)\), Dutch traces \(z_b(a)\), the reward \(r_b\), the running average reward \(\bar{R}_b\), and a one-hot action vector \(\mathbf{x}_b\) capturing the response that occurred in the bin (or zeros if no response).

Learning Updates

Temporal-difference errors and updates follow: \[ \delta_b = r_b - \bar{R}_b, \] \[ \mathbf{z}_{b+1} = \lambda_{\text{bin}} \mathbf{z}_b + \left(1 - \alpha \lambda_{\text{bin}} \mathbf{z}_b^\top \mathbf{x}_b\right) \mathbf{x}_b, \] \[ \mathbf{Q}_{b+1} = \mathbf{Q}_b + \alpha \, \delta_b \, \mathbf{z}_b, \] \[ \bar{R}_{b+1} = (1 - \eta)\bar{R}_b + \eta r_b. \] Here \(\alpha\) is the per-bin learning rate, \(\eta\) controls how quickly the average-reward baseline tracks recent outcomes, and \(\tau_e\) sets the eligibility horizon in continuous time.

Action Selection

Whether to respond uses a vigor function over time since the last press (\(t_{\text{diff}}\)) and total action value: \[ \text{logit}(p_{\text{respond}}) = \beta_0 + \beta_t \, t_{\text{diff}} + \beta_q \, (Q_x + Q_y). \] Responding is prohibited during a refractory period \(\phi\), which is an explicit free parameter (see refractory_s in the table below) rather than an emergent quantity of the vigor equation.

Conditional on responding, the agent samples an action from a sticky softmax: \[ m_b(a) = \kappa Q_b(a) + \omega \, \mathbb{I}[a = a_{\text{last}}], \] \[ P(a) = \frac{\exp(m_b(a))}{\sum_j \exp(m_b(j))}. \] The inverse-temperature \(\kappa\) tunes explore/exploit balance, while \(\omega\) biases the agent toward repeating the previous action.

Initial comparison of response probability function

In the earlier white paper, I conceived of response probability as an exponential recovery function

\[ \begin{align} p(\textrm{respond}_{t,b}) &= 1 - \textrm{exp}\left[ - (\nu + \gamma Q^*)(\tau - \textrm{rt}_\textrm{last}) \right] \\[0.3cm] Q^*_{t,b} &= \sum_{a=1}^{2} Q_{t,a,b} \end{align} \]

In the current code/parameterization, this has switch to a logit hazard model. I need to think more about this change. But here is an initial AI perspective on it:

The current simulator/Stan code borrowed the logistic hazard because it made earlier vigor parameterizations compatible with existing binary-response datasets.

  • Strengths of the logit hazard
    1. it plugs straight into Bernoulli GLM machinery (easy priors, link inv-logit)
    2. it naturally accommodates covariates like t_since, Q*, or subject-level random effects
    3. it combines cleanly with a hard refractory period—simply set the Bernoulli prob to zero during locked-out bins.
  • Weaknesses of the logit hazard:
    1. Lacks the “zero immediately after a response” property unless you manually enforce it (the added refractory flag is a workaround).
    2. The logit link can predict negative hazards if interpreted continuously; it is fundamentally a discrete-time choice model rather than a renewal process.
    3. It introduces a separate refractory parameter that partly duplicates what an exponential hazard would already encode.

The exponential recovery form is a continuous-time hazard model derived from a renewal process: the probability grows as \((1 - e^{-(\nu+\gamma Q^*)(\tau - \text{rt}_\text{last})})\).

  • Strengths of exponential recovery
    1. Behaviorally grounded probability is zero at the moment of a response and rises smoothly with elapsed time without needing an explicit refractory flag.
    2. Multiplicative separation between baseline vigor (ν) and value modulation (γ) matches the intuition that higher total value speeds up recovery.
    3. Parameters retain interpretable “rate” units and align with the white paper spec.
  • Weaknesses of the exponential form:
    1. Harder to extend with extra predictors (e.g., nuisance regressors) because everything factors into the rate term; heterogeneous effects might require more elaborate hazard parameterizations.
    2. Constraining (\(\nu + \gamma Q^*\)) to stay positive demands careful priors or transformations in Stan.
    3. If responses can occur in discrete bins, the discretized probability is \((1 - e^{-\lambda \Delta_t})\); mapping that to observed Bernoulli data requires consistent bin widths and may be less robust to variable time steps.
    In short, the logistic form was chosen for implementation convenience and flexibility with discrete-time Bernoulli modeling, but if your theoretical commitment is to a recovery process with physically meaningful rates, switching back to the exponential hazard is well motivated. The exponential form will require re-deriving the Bernoulli probability per bin (from the integrated hazard) and enforcing positivity constraints on ν and γ, but it eliminates the ad hoc refractory period and aligns the code with the original white paper.

Reward Scheduling

Each response can start a reward “job” with action-specific probability \(p_{\text{trigger}}(a)\). Rewards are delivered after an exponentially distributed delay with mean \(\tau_{\text{delay}}(a)\) and magnitude \(R(a)\). Dutch traces distribute credit backward through time, enabling the model to learn despite the delays.

Parameter Summary

Key model parameters and interpretations.
Parameter Interpretation
alpha Learning rate for action values (per bin)
eta Learning rate for the average-reward baseline
tau_e Eligibility time constant (s)
beta0_vigor Baseline logit for responding
beta_t_vigor Effect of time since last response on vigor
beta_q_vigor Effect of total value (Qx + Qy) on vigor
refractory_s Hard refractory period (s)
kappa Inverse temperature for softmax (explore/exploit)
omega_sticky Stickiness weight favoring the previous action
p_trigger Action-specific probability a response schedules a reward
delay_mean_s Mean delay before reward delivery (s)
reward_amount Reward magnitude per action

Scenario Overview

The simulator in eligibility_test.R defines a set of illustrative scenarios that vary key learning and control parameters. Each scenario is simulated for 4,000 bins (200 s) unless otherwise noted, and the table below summarizes core behavioral metrics.

Behavioral summary across scenarios (responses per second, action preference, reward rate, and mean TD error).
scenario description responses_per_s choice_x reward_rate avg_delta
Baseline symmetric Balanced environment with default parameters. 0.600 0.258 0.280 0.001
Pure exploration Low kappa without stickiness keeps choices balanced under symmetry. 0.525 0.371 0.245 0.001
Sticky streaks Low kappa plus strong stickiness produces long streaks on the same action. 0.605 0.157 0.280 0.001
Deterministic symmetric High kappa with zero stickiness in a symmetric task locks in after early noise. 0.640 0.992 0.310 0.001
Asymmetric payoffs High kappa plus richer X rewards yields strong exploitation. 0.880 0.989 0.752 0.003
Delayed rewards, short trace Long reward delays with short eligibility (tau_e) hinder credit assignment. 0.425 0.412 0.240 0.000
Long eligibility Longer traces (tau_e) and slower learning spread credit over time. 0.505 0.416 0.265 0.001
Fast baseline Large eta lets the average-reward baseline track payoffs quickly. 0.465 0.892 0.245 0.000
Slow baseline Tiny eta keeps the baseline sluggish, yielding larger delta swings. 0.575 0.896 0.265 0.005
High vigor Elevated vigor parameters yield denser responding despite refractory limits. 5.845 0.010 2.950 0.004
Hard refractory Half-second refractory windows create pronounced pauses between responses. 0.365 0.658 0.180 0.000
Weak vigor Lower vigor intercept/slope suppresses response initiation. 0.110 0.364 0.075 0.000

Scenario Walkthrough

The following sections document each scenario in detail, including the motivation, parameter adjustments relative to the defaults, observed metrics, and the full set of diagnostic plots produced by plot_dynamics().

Baseline symmetric

Balanced environment with default parameters.

Pure exploration

Low kappa without stickiness keeps choices balanced under symmetry.

Sticky streaks

Low kappa plus strong stickiness produces long streaks on the same action.

Deterministic symmetric

High kappa with zero stickiness in a symmetric task locks in after early noise.

Asymmetric payoffs

High kappa plus richer X rewards yields strong exploitation.

Delayed rewards, short trace

Long reward delays with short eligibility (tau_e) hinder credit assignment.

Long eligibility

Longer traces (tau_e) and slower learning spread credit over time.

Fast baseline

Large eta lets the average-reward baseline track payoffs quickly.

Slow baseline

Tiny eta keeps the baseline sluggish, yielding larger delta swings.

High vigor

Elevated vigor parameters yield denser responding despite refractory limits.

Hard refractory

Half-second refractory windows create pronounced pauses between responses.

Weak vigor

Lower vigor intercept/slope suppresses response initiation.